Midterm project for DATA624 course in CUNY’s MSDS program
First I load in the provided data and take a look at the first few rows. I see that there is an unnecessary hour/minute/second component to our date, so I trim that off.
atm_raw <- read_csv("atm_raw.csv")
head(atm_raw)
## # A tibble: 6 x 3
## DATE ATM Cash
## <chr> <chr> <dbl>
## 1 5/1/2009 12:00:00 AM ATM1 96
## 2 5/1/2009 12:00:00 AM ATM2 107
## 3 5/2/2009 12:00:00 AM ATM1 82
## 4 5/2/2009 12:00:00 AM ATM2 89
## 5 5/3/2009 12:00:00 AM ATM1 85
## 6 5/3/2009 12:00:00 AM ATM2 90
#start cleaned dataset
atm <- atm_raw
#shave off unecessary hour/minute/second of date variable
#atm$DATE <- str_sub(atm$DATE, end=-13)
#set to date
atm$DATE = as.Date(atm$DATE, format='%m/%d/%Y 12:00:00 AM', origin = '1990-01-01')
A first glance at the data I see there is some work to do. I see 14 cases having no value for ATM, these can be removed as I see these are associated with May dates (which we’ll be predicting later) and no Cash value is present. I also see that ATM1 and ATM2 have a couple missing values. ATM3 appears to have a strange distribution with nearly all the percentiles having a value of zero.
atm %>%
group_by(ATM) %>%
skim()
| Name | Piped data |
| Number of rows | 1474 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | ATM |
Variable type: Date
| skim_variable | ATM | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| DATE | ATM1 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM2 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM3 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM4 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | NA | 0 | 1 | 2010-05-01 | 2010-05-14 | 2010-05-07 | 14 |
Variable type: numeric
| skim_variable | ATM | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Cash | ATM1 | 3 | 0.99 | 83.89 | 36.66 | 1 | 73.0 | 91 | 108 | 180 | ▂▁▇▃▁ |
| Cash | ATM2 | 2 | 0.99 | 62.58 | 38.90 | 0 | 25.5 | 67 | 93 | 147 | ▇▅▇▇▂ |
| Cash | ATM3 | 0 | 1.00 | 0.72 | 7.94 | 0 | 0.0 | 0 | 0 | 96 | ▇▁▁▁▁ |
| Cash | ATM4 | 0 | 1.00 | 474.01 | 650.95 | 2 | 124.0 | 404 | 705 | 10920 | ▇▁▁▁▁ |
| Cash | NA | 14 | 0.00 | NaN | NA | NA | NA | NA | NA | NA |
I drop the cases where the ATM variable is NA, as these also had empty Cash values.
#remove cases with ATM=NA
atm <- atm %>%
drop_na(ATM)
Let’s take a closer look at ATM3 cash values. It appears there are 362 days with a value of zero Cash withdrawn, and 1 day each for 8,200, 8,500, and 9,600.
atm %>%
filter(ATM == "ATM3") %>%
group_by(Cash) %>%
summarize(n = n())
## # A tibble: 4 x 2
## Cash n
## <dbl> <int>
## 1 0 362
## 2 82 1
## 3 85 1
## 4 96 1
Looking at the only cases where ATM3 has greater than zero for the Cash variable, I see that these are the last 3 days of the time series data. This suggests the machine was previously broken or was only installed on 4/28/2010, explaining the lack of data for the majority of this time series dataset. As a prediction based off only 3 data points would not be wise, I’ll disregard ATM3 moving forward and relay to my boss that we’ll need more data before we can forecast for ATM3.
atm %>%
filter(ATM == "ATM3",
Cash > 0)
## # A tibble: 3 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM3 96
## 2 2010-04-29 ATM3 82
## 3 2010-04-30 ATM3 85
Finally, I’ll set each ATM to it’s own dataframe.
atm$Cash <- as.numeric(unlist(atm$Cash))
atm1 <- atm %>%
filter(ATM == "ATM1")
atm2 <- atm %>%
filter(ATM == "ATM2")
atm4 <- atm %>%
filter(ATM == "ATM4")
Next I’ll see how the imputeTS package would impute the handful of NA values using the linear (default) imputation method.
The red values it’s suggesting appear reasonable to my eye, and considering it’s a very small proportion of the full dataset (0.8%) this is acceptable. ###consider seasonally decomposed imputation?? na_seadec()
#store imputation
imputation <- na_interpolation(atm1$Cash)
#plot imputation
ggplot_na_imputations(atm1$Cash, imputation) +
my_plot_theme
#incorporation imputation
atm1$Cash <- na_interpolation(atm1$Cash)
And again for ATM2 we see reasonable values for the default imputation method.
#store imputation
imputation <- na_interpolation(atm2$Cash)
#plot imputation
ggplot_na_imputations(atm2$Cash, imputation) +
my_plot_theme
#incorporation imputation
atm2$Cash <- imputation
I plot each ATM as a boxplot below to get a better look at the distributions. ATM1 and ATM2 seem the most normally distributed, though there are quite a few outliers on the low end in ATM1. Further, ATM1 has the highest median of the three ATMs. Finally, ATM4 has a very small range of values, however, taking note of the scale this is an optics issue caused by that extreme outlier well over $900,000. I’ve stumbled on another data cleaning issue.
atm1_box <- ggplot(atm1, aes(x = Cash)) +
geom_boxplot()
atm2_box <- ggplot(atm2, aes(x = Cash)) +
geom_boxplot()
atm4_box <- ggplot(atm4, aes(x = Cash)) +
geom_boxplot()
ggarrange(atm1_box, atm2_box, atm4_box,
labels = c("ATM1", "ATM2", "ATM4"),
ncol = 1) +
my_plot_theme
A nice function from the forecast package allows it to see the number of cases that are outliers. The function tsoutliers() has further documentation here, but essentially is better suited to detected outliers of a time series more than standard data as it decomposes the series and considers seasonality and trend before determining outliers. I’ll now convert each ATM dataframe into a tsibble object so I can use the forecast package’s functionality moving forward.
ATM1 has over 30 outliers, which is consistent with the boxplot above. We also see the recommended replacement values. While this are statistically identified as outliers, there are so many it would be unwise to remove and/or replace these values. It’s good to know these exist, but I won’t fiddle with them.
#make tsibble version that works with tsoutliers
atm1 <- as.data.frame(atm1)
atm1_ts <- atm1 %>%
select(-DATE) %>%
mutate(Date = as.Date("2009-05-01") + 0:364) %>%
as_tsibble(index = Date, key = Cash)
#check for outliers
tsoutliers(atm1_ts$Cash)
## $index
## [1] 45 46 47 48 49 50 51 52 53 54 55 56 62 63 64 65 66 67 68
## [20] 69 70 71 72 73 74 352 353 354 355 356 357 358 359 360 361 362 363 364
## [39] 365
##
## $replacements
## [1] 20.46154 21.92308 23.38462 24.84615 26.30769 27.76923 29.23077
## [8] 30.69231 32.15385 33.61538 35.07692 36.53846 51.28571 52.57143
## [15] 53.85714 55.14286 56.42857 57.71429 59.00000 60.28571 61.57143
## [22] 62.85714 64.14286 65.42857 66.71429 138.00000 138.00000 138.00000
## [29] 138.00000 138.00000 138.00000 138.00000 138.00000 138.00000 138.00000
## [36] 138.00000 138.00000 138.00000 138.00000
#make tsibble version that will work with autoplot later
atm1_ts <- as.data.frame(atm1_ts)
atm1_ts <- tsibble(
Date = as.Date("2009-05-01") + 0:364,
Cash = atm1$Cash)
ATM2 has far fewer outliers, only 5. These didn’t show up on the boxplot above, but that is likely because the tsoutliers() function found outliers with regard to the seasonality and/or trend, whereas the box-plot is just looking at raw numbers and a raw distribution. While these values are certainly high, I don’t belive they are in error so I won’t remove/replace these either.
#make tsibble version that works with tsoutliers
atm2 <- as.data.frame(atm2)
atm2_ts <- atm2 %>%
select(-DATE) %>%
mutate(Date = as.Date("2009-05-01") + 0:364) %>%
as_tsibble(index = Date, key = Cash)
#check for outliers
tsoutliers(atm2_ts$Cash)
## $index
## [1] 361 362 363 364 365
##
## $replacements
## [1] 136 136 136 136 136
#make tsibble version that will work with autoplot later
atm2_ts <- as.data.frame(atm2_ts)
atm2_ts <- tsibble(
Date = as.Date("2009-05-01") + 0:364,
Cash = atm2$Cash)
ATM4 has the highest number of outliers we’ve seen, and it’s unclear from this readout which is the extreme value seen in the boxplot.
#make tsibble version that works with tsoutliers
atm4 <- as.data.frame(atm4)
atm4_ts <- atm4 %>%
select(-DATE) %>%
mutate(Date = as.Date("2009-05-01") + 0:364) %>%
as_tsibble(index = Date, key = Cash)
#check for outliers
tsoutliers(atm4_ts$Cash)
## $index
## [1] 300 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
## [20] 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349
## [39] 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365
##
## $replacements
## [1] 804 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847
## [20] 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847
## [39] 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847
#make tsibble version that will work with autoplot later
atm4_ts <- as.data.frame(atm4_ts)
atm4_ts <- tsibble(
Date = as.Date("2009-05-01") + 0:364,
Cash = atm4$Cash)
I filter by values where Cash is greater than 10,000 and find this on the date 02-09-2010. A this value is so extreme I’ll replace it with the median.
#finder outlier
atm4_ts %>%
filter(Cash > 10000)
## # A tsibble: 1 x 2 [1D]
## Date Cash
## <date> <dbl>
## 1 2010-02-09 10920
#set outlier to NA for now
atm4_ts$Cash[atm4_ts$Cash=="10920"]<-NA
#calculate median w/o extreme outlier
atm4_median <- median(atm4_ts$Cash, na.rm = TRUE)
#put median in for outlier/NA value
atm4_ts$Cash[atm4_ts$Cash==NA]<-atm4_median
I revisit the boxplot and see a few outliers on the high end, but nothing as extreme as I saw earlier.
ggplot(atm4_ts, aes(x = Cash)) +
geom_boxplot() +
my_plot_theme
Now I can proceed looking for trend, seasonality, and/or the need for Box-Cox transformation on each of the ATMs’ time series data.
A plot of the cleaned ATM1 data shows some changes in variance and certainly seasonality.
autoplot(atm1_ts, Cash)
In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.35. The plot still has plenty of seasonality, but the variance is more stable.
#calculate lambda
lambda1 <- atm1_ts %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
#plug into autoplot
ggplotly(atm1_ts %>%
autoplot(box_cox(Cash, lambda1)) +
labs(y = "",
title = "Transformed ATM1 Cash with lambda = 0.35"))
#apply transformation
atm1_tf <-
atm1_ts %>%
mutate(Cash = box_cox(Cash, lambda1))
Now I’ll check the residuals. I clearly see lags on day 7, 14, and 21 on the ACF plot, which makes sense as it’s likely seasonal around the days of the week. I’ll definitely need a model that incorporates this seasonality.
#check residuals
ggtsdisplay(atm1_tf$Cash, main="ATM1 Residuals after Box-Cox")
A plot of the cleaned ATM2 data shows some changes in variance and certainly seasonality - similar to ATM
autoplot(atm2_ts, Cash)
In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.68. The plot still has plenty of seasonality, but the variance is more stable.
#calculate lambda
lambda2 <- atm2_ts %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
#plug into autoplot
ggplotly(atm2_ts %>%
autoplot(box_cox(Cash, lambda2)) +
labs(y = "",
title = "Transformed ATM2 Cash with lambda = 0.68"))
#apply transformation
atm2_tf <-
atm2_ts %>%
mutate(Cash = box_cox(Cash, lambda2))
Now I’ll check the residuals. Again I see lags at multiples of 7, but also more prominent ones at 2, 5, and 9 that will have to be checked on.
#check residuals
ggtsdisplay(atm2_tf$Cash, main="ATM2 Residuals after Box-Cox")
A plot of the cleaned ATM4 data shows seasonality with some of the most variance we’ve seen so far.
autoplot(atm4_ts, Cash)
In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.40. The plot still has plenty of seasonality, but the variance is more stable.
#calculate lambda
lambda4 <- atm4_ts %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
#plug into autoplot
ggplotly(atm4_ts %>%
autoplot(box_cox(Cash, lambda4)) +
labs(y = "",
title = "Transformed ATM4 Cash with lambda = 0.40"))
#apply transformation
atm4_tf <-
atm4_ts %>%
mutate(Cash = box_cox(Cash, lambda4))
Now I’ll check the residuals. I see primarily lags at multiples of 7, not at the other intervals seen in ATM2.
#check residuals
ggtsdisplay(atm4_tf$Cash, main="ATM4 Residuals after Box-Cox")
It’s certainly all of our ATMs will need modeling that incorporated their seasonality around the weekdays. I’ll try the simplest option, the Seasonal Naive Method first.
atm1_tf %>%
model(
`SNAIVE` = SNAIVE(Cash)) %>%
forecast(h = 15) %>%
autoplot(atm1_tf) +
labs(title = "ATM1 - Seasonal Naive Model",
y = "USD in Hundreds")
atm1_tf %>%
model(
`Winter-Holts` = ETS(Cash ~ error("A") +
trend("A") + season("M"))) %>%
forecast(h = 15) %>%
autoplot(atm1_tf) +
labs(title = "ATM1 - Seasonal Naive Model",
y = "USD in Hundreds")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Look at a summary of the data we see there are 3 variables, CaseSequence showing the order of the YYYY-MMM variable, and finally the KWH variable showing the power usage. KWH has one missing value we will want to impute before we proceed.
power_raw <- read_csv("power_raw.csv")
skim(power_raw)
| Name | power_raw |
| Number of rows | 192 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| YYYY-MMM | 0 | 1 | 8 | 8 | 0 | 192 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CaseSequence | 0 | 1.00 | 828.5 | 55.57 | 733 | 780.75 | 828.5 | 876.25 | 924 | ▇▇▇▇▇ |
| KWH | 1 | 0.99 | 6502474.6 | 1447570.89 | 770523 | 5429912.00 | 6283324.0 | 7620523.50 | 10655730 | ▁▁▇▅▁ |
#start cleaned dataset
power <- power_raw
#transform date
power <- power %>%
mutate(Month = yearmonth(`YYYY-MMM`)) %>%
select(-`YYYY-MMM`)
Below I use the imputeTS package to generate a value for the identified NA value, using the linear (default) method. The point seems reasonable and to take into account the seasonality we see in the data so I incorporate it.
#store imputation
imputation <- na_interpolation(power$KWH)
#plot imputation
ggplot_na_imputations(power$KWH, imputation) +
my_plot_theme
#incorporate imputation
power$KWH <- imputation
I start by visualizing the data with a boxplot and see there is one low outlier we may want to check on.
power %>%
subset(select = KWH) %>%
ggplot(aes(KWH)) +
geom_boxplot() +
labs(title = "Checking KWH Distribution") +
my_plot_theme
power %>%
subset(select = KWH) %>%
ggplot(aes(KWH)) +
geom_histogram() +
labs(title = "Checking KWH Distribution") +
my_plot_theme
The outlier shown above on the box plot appears to be from July 2010. In a real-world setting I would be curious to know from coworkers if there was a large outage for part of this month that decreased consumption, or possible a measurement error. As I don’t have the ability to do that, and it’s the only significant outlier, I’m going to replace it with the median.
power %>%
filter(KWH < 1000000)
## # A tibble: 1 x 3
## CaseSequence KWH Month
## <dbl> <dbl> <mth>
## 1 883 770523 2010 Jul
#set outlier to NA for now
power$KWH[power$KWH=="770523"]<-NA
#calculate median w/o extreme outlier
power_median <- median(power$KWH, na.rm = TRUE)
#put median in for outlier/NA value
power$KWH[power$KWH==NA]<-power_median
Checking the boxplot again the distribution appears much more normal with still a fair amount of variation.
ggplot(power, aes(x = KWH)) +
geom_boxplot() +
my_plot_theme
After transforming out data to a tsibble, a quick check of the line graph shows we’ll want to do a Box-Cox transformation to address the variance over time.
#transfer to tsibble
power_ts <- tsibble(
Month = yearmonth(power$Month),
KWH = power$KWH,
)
autoplot(power_ts)
The Box-Cox results in a lambda of -0.23 and we can see the transformation addresses the variance seen above.
#calculate lambda
lambda <- power_ts %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
#plug into autoplot
ggplotly(power_ts %>%
autoplot(box_cox(KWH, lambda)) +
labs(y = "",
title = "Transformed KWH with lambda = -0.23"))
#apply transformation
power_tf <-
power_ts %>%
mutate(KWH = box_cox(KWH, lambda))
The residuals show some seasonality, possibly lower usage in month 4 (April) and high usage in in month 7 (July). This would make sense if it’s data from a region in the USA.
#check residuals
ggtsdisplay(power_ts$KWH, main="KWH Residuals after Box-Cox")